Latent Process Decomposition Of High-Dimensional Count Data
نویسنده
چکیده
Motivation: Next-generation sequencing (NGS) technologies have become the preferred way of exploring a genome. These data are high-dimensional discrete counts with correlated variables (e.g., genes). We present a novel latent factor model for high-dimensional count data, Latent Process Decomposition (LPD-C), that accounts for the correlations among genes and models the biological hypothesis that genes work in groups (e.g., pathways), which are referred to as processes. LPD-C is a two stage unsupervised approach for grouping genes into a pre-specified number of clusters, and for selecting genes that belong to these clusters with high probability. The first stage of LPD-C uses a variational Bayes approach for efficient estimation of its parameters. The second stage of LPD-C selects genes grouped as gene-subsets using empirical Bayes hypothesis testing. Results: The performance of LPD-C is explored using simulated and publicly available NGS data, compared with existing approaches, and shown to be a useful and extensible framework for identifying genes suitable for further exploration. Although we apply LPD-C in a genomic context, it can be used for any high-dimensional count data. Availability: R code for fitting LPD-C is available from the authors on request. Contact: [email protected]
منابع مشابه
Latent Process Decomposition of High-Dimensional Count Data By
Next-generation sequencing (NGS) technologies have become the preferred way of exploring a genome. These data are high-dimensional discrete counts with latent structure that, once revealed, will reduce the dimensions and will lead to the subsets of genes that are suitable for further exploration. Latent Process Decomposition of high-dimensional count data (LPD-C) is presented as a two stage app...
متن کاملDistributed Latent Dirichlet Allocation on Spark via Tensor Decomposition
Learning latent variable mixture models in high-dimension is applicable in numerous domains where low dimensional latent factors out of the high-dimensional observations are desired. Popular likelihood based methods optimize over a non-convex likelihood which is computationally challenging to achieve due to the high-dimensionality of the data, and therefore it is usually not guaranteed to conve...
متن کاملUsing multivariate generalized linear latent variable models to measure the difference in event count for stranded marine animals
BACKGROUND AND OBJECTIVES: The classification of marine animals as protected species makes data and information on them to be very important. Therefore, this led to the need to retrieve and understand the data on the event counts for stranded marine animals based on location emergence, number of individuals, behavior, and threats to their presence. Whales are g...
متن کاملHigh-dimensional neural spike train analysis with generalized count linear dynamical systems
Latent factor models have been widely used to analyze simultaneous recordings of spike trains from large, heterogeneous neural populations. These models assume the signal of interest in the population is a low-dimensional latent intensity that evolves over time, which is observed in high dimension via noisy point-process observations. These techniques have been well used to capture neural corre...
متن کاملDiscovery of Latent Factors in High-dimensional Data Using Tensor Methods
OF THE DISSERTATIONDiscovery of Latent Factors in High-dimensional Data Using Tensor MethodsByFurong HuangDoctor of Philosophy in Electrical and Computer EngineeringUniversity of California, Irvine, 2016Assistant Professor Animashree Anandkumar, Chair Unsupervised learning aims at the discovery of hidden structure that drives the observationsin the real world. It is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013